adversarial bandit
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- Asia > China > Hong Kong (0.04)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- North America > Canada (0.04)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.71)
- Information Technology > Artificial Intelligence > Machine Learning (0.71)
- Information Technology > Data Science > Data Mining > Big Data (0.48)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.40)
Online EXP3 Learning in Adversarial Bandits with Delayed Feedback
Consider a player that in each of T rounds chooses one of K arms. An adversary chooses the cost of each arm in a bounded interval, and a sequence of feedback delays \left{ d {t} at round t, the player receives the cost of playing this arm d {t}>T, this feedback is simply missing. We prove that the EXP3 algorithm (that uses the delayed feedback upon its arrival) achieves a regret of O\left(\sqrt{\ln K\left(KT+\sum {t}\right)}\right). For the case where \sum {t} and T are unknown, we propose a novel doubling trick for online learning with delays and prove that this adaptive EXP3 achieves a regret of O\left(\sqrt{\ln K\left(K^{2}T+\sum {t}\right)}\right). We then consider a two player zero-sum game where players experience asynchronous delays. We show that even when the delays are large enough such that players no longer enjoy the "no-regret property", (e.g., where d {t} that is not summable but is square summable, and proving a "weighted regret bound" for this general case.
A Gang of Adversarial Bandits
We consider running multiple instances of multi-armed bandit (MAB) problems in parallel. A main motivation for this study are online recommendation systems, in which each of $N$ users is associated with a MAB problem and the goal is to exploit users' similarity in order to learn users' preferences to $K$ items more efficiently. We consider the adversarial MAB setting, whereby an adversary is free to choose which user and which loss to present to the learner during the learning process. Users are in a social network and the learner is aided by a-priori knowledge of the strengths of the social links between all pairs of users. It is assumed that if the social link between two users is strong then they tend to share the same action. The regret is measured relative to an arbitrary function which maps users to actions. The smoothness of the function is captured by a resistance-based dispersion measure $\Psi$. We present two learning algorithms, GABA-I and GABA-II, which exploit the network structure to bias towards functions of low $\Psi$ values.
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Italy > Lombardy > Milan (0.04)
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.46)
- North America > United States > California (0.14)
- North America > Canada (0.04)
- North America > United States > California (0.14)
- North America > Canada (0.04)